The Clean and the Green

Investigating the Relationship Between Income and Sanitation

Author

Giovani Thai, William Gladden, Soren Paetau

Code
library(tidyverse)
library(broom)
library(ggridges)
library(RColorBrewer)
library(knitr)
library(DT)
library(kableExtra)
library(gganimate)
library(gifski)
library(patchwork)

Introduction

Code
sanitation.data <- read_csv("at_least_basic_sanitation_overall_access_percent.csv")
income.data <- read_csv("income_per_person_gdppercapita_ppp_inflation_adjusted.csv")
colnames(income.data) <- c("country", as.numeric(1799:2049))

We are interested in the relationship between income and sanitation levels. One might assume that as income per person increases, so too will the access to basic necessities, including hygienic products and facilities, such as bathrooms and showers. We aim to analyze data relating to these factors, outlined in the process below.

Code
income.data |>
  slice_head(n = 1000) |>
  select(country, `1999`:`2002`) |>
  datatable(class = "cell-border",
            rownames = FALSE,
            caption = "Table 1: Preview of Income Data Set, 1999-2002.",
            filter = "top")

Our income data set measures the total GDP of a country per person for 216 countries between the years of 1892 and 2049, with units in fixed 2017 prices. This number is adjusted for PPP, or the differences between costs of living and essential products, across countries. The data comes from the World Bank, the Maddison Project Database, and the Penn World Table. Historical estimates were used for early years; forecasts from the IMF’s Economic Outlook were used to project income in the future.

Code
sanitation.data |>
  slice_head(n = 1000) |>
  select(1:5) |>
  datatable(class = "cell-border",
            rownames = FALSE,
            caption = "Table 2: Preview of Sanitation Data Set, 1999-2002.",
            filter = "top")

Our sanitation data set measures the percentage of people (living in both urban and rural settings) who use at least basic sanitation services not shared with other households. This includes flushing and pouring to piped sewer systems, septic, and ventilation for pit latrines, such as squat toilets.

The data was collected by the World Health Organization and UNICEF, falling between the years of 1999 to 2019.

We hypothesize that there will be a positive relation between income per person and percentage of people with basic sanitation at their disposal. We would like to be able to predict a country’s percentage of people who have access to basic sanitation services based on that country’s income per person.

We continue our analysis by merging these two data sets to focus on this hypothesized relationship between 1999 and 2019.

The cleaning process requires removing instances of “k” from some data entries, representing thousands of dollars. For example, we had to perform the following conversion: \(12.3k \to 12300\) for such entries. Notice that the decision to drop NA values was made, since ggplot and lm will drop NA values and we have an abundance of data entries. This decision will be more convenient later down the road when taking advantage of the country column.

Code
convert_num <- function(x){ #converts "12.3k" to 12300,  
  temp <- x
  if(str_detect(x, "k$")){
    temp <- x |> 
      str_replace_all("[^[:digit:]]", "") |> #gross but it just replaces any non-number with ""
      as.numeric() * 100
  }
  return(temp)
}

#data cleaning + joining
sanitation.clean <- sanitation.data |>
  drop_na() |>
  pivot_longer(cols = `1999`:`2019`,
               names_to = "year",
               values_to = "sanitation")

income.clean <- income.data |>
  drop_na() |> #pretty bold step to remove any rows with na vals, 
               #but due to the abundance of data seems reasonable
  select(c(country, `1999`:`2019`)) |>
  pivot_longer(cols = `1999`:`2019`,
               names_to = "year",
               values_to = "income") |>
  mutate(income = as.numeric(map(income, convert_num)))#converts characters,
  #able to convert after pivot, b/c data is all seen as strings!

Below is the final, merged, and cleaned data set.

Code
full.data <- inner_join(sanitation.clean, income.clean) |>
  mutate(year = as.numeric(year))

full.data |>
  slice_head(n = 1000) |>
  datatable(class = "cell-border",
            rownames = FALSE,
            caption = "Table 3: Preview of Final Data Set.",
            filter = "top")

Linear Regression

We aim to piece together the relationship between income and sanitation levels with a linear model. Our goal is to develop a model that predicts the percent of people with basic sanitation (response) from adjusted income per person (explanatory).

We first visualize the relation between these two variables.

Code
data.plot <- full.data |>
  ggplot(mapping = aes(x = income, y = sanitation) ) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(limits = c(0,100),
                     breaks = seq(0,100,25)) +
  labs(title = "Income versus Sanitation, 1999-2019",
       subtitle = "Percentage of People with (at least) Basic Sanitation Services",
       x = "Adjusted Income per Person (in 2017 $)",
       y = "") 

data.plot

Although the above scatter plot is very messy, we can see a general pattern: countries with low adjusted income per person have lower percentages of people with basic sanitation, whereas countries with high income per person rarely have less than 90 percent of people with basic sanitation.

Code
us.plot <- full.data |>
    mutate(year = as.numeric(year)) |>
    filter(country == "United States") |>
    ggplot(mapping = aes(x = income, y = sanitation)) +
    geom_point() +
    transition_time(year) +
    labs(title = "United States Income vs. Sanitation, 1999-2019",
        subtitle = "Year: {floor(frame_time)}",
        x = "Adjusted Income Per Person (in 2017 $)",
        y = "Percentage of People with (at least) Basic Santitation") +
    shadow_mark(alpha = 1, 
                size = 1) 

w = animate(us.plot, duration = 5, fps = 20, renderer = gifski_renderer(), end_pause = 40)
anim_save("animated_US_plot.gif", animation = w)

cuba.plot <- full.data |>
    mutate(year = as.numeric(year)) |>
    filter(country == "Cuba") |>
    ggplot(mapping = aes(x = income, y = sanitation)) +
    geom_point() +
    transition_time(year) +
    labs(title = "Cuba Income vs. Sanitation, 1999-2019",
        subtitle = "Year: {floor(frame_time)}",
        x = "Adjusted Income Per Person (in 2017 $)",
        y = "Percentage of People with (at least) Basic Santitation") +
    shadow_mark(alpha = 1, 
                size = 1)

z = animate(cuba.plot, duration = 5, fps = 20, renderer = gifski_renderer(), end_pause = 40)
anim_save("animated_Cuba_plot.gif", animation = z)

However, there are instances where countries with very low income per person have high percentages of people with basic sanitation. For example, Cuba has a relatively low adjusted income per person in the years between 1999 and 2019, but still enjoyed a sanitation percentage above 88 percent; this continued to rise as income per person increased.

Another exception to the rule has some countries drop to low income per person while maintaining similar sanitation levels. There was a period in time when the United States’ adjusted income per person dropped below $20,000; yet the percentage of people with basic sanitation never drops below 99 percent within this time frame. Despite an obvious general trend, further investigation is needed to stake a stronger claim.

Our next step is to look at how the relationship between our variables of interest changed over time.

Code
temp.plot <- full.data |>
  mutate(year = as.numeric(year)) |>
  ggplot(mapping = aes(x = income,
                       y = sanitation)) +
  geom_point(show.legend = F) +
  transition_time(year) +
  labs(title = "Income versus Sanitation, 1999-2019",
       subtitle = "Year: {floor(frame_time)}",
       x = "Adjusted Income per Person (in 2017 $)",
       y = "Percentage of People with (at least) Basic Sanitation Services") +
  shadow_mark(keep_layer = T, size = 0.1, alpha = 0.5) +
  enter_fade() +
  exit_fade()

anim_time = animate(temp.plot, duration = 10, fps = 10, renderer = gifski_renderer())
anim_save("animated_time_plot.gif", animation = anim_time)

The plot above, specifically the shadow trails on each point, tells us that almost every data point experiences an increase in percentage of people with basic sanitation services, coupled with overall increases in adjusted income per person. This gives us more confidence that our hypothesized relationship between our variables is correct, but more in depth analysis is needed before we can conclude anything.

Upon fitting a simple linear regression model for percent of people who have access to basic sanitation (\(Y\)) based on adjusted income per person (\(X\)), we acquired the following regression equation:

\[\hat{Y} = 55.75 + 0.0009699X\]

Code
data.plot + geom_smooth(method = "lm", color = "red")

Our simple linear regression equation tells us that a country with an average adjusted income of $0 will be predicted to have a 55.75 percent of its people with access to basic sanitation services. Note that this conclusion is an extrapolation since our data does not contain any information about any countries with a $0 adjusted income per person. Our equation also tells us for every ten thousand dollar increase in adjusted income per person, a country’s average sanitation percentage increases by 9.699 points.

To assess the validity of our linear model, we look at the variance in observed sanitation percentage, predicted sanitation percentage, and residuals.

Code
model_var <- augment(data.fit) |>
  summarise(var.resp = var(sanitation),
         var.fitted = var(.fitted),
         var.resid =var(.resid))

model_var |>
  kbl(
    col.names = c("Variance in Sanitation", "Variance in Predicted", "Variance in Residuals"),
    caption = "Table 4: Variance in Observed/Predicted/Residual Sanitation."
  ) |>
  kable_styling(bootstrap_options = "bordered",
                position = "center",
                font_size = 18,
                html_font = "helvetica"
                )
Table 4: Variance in Observed/Predicted/Residual Sanitation.
Variance in Sanitation Variance in Predicted Variance in Residuals
942.5548 313.8511 628.7037

The proportion of variability accounted for by our model was 0.3329791. Our model does a good enough job to communicate the general idea that the more money people are making, the make likely it is that they will have access to basic sanitation services. However, the variable we are hoping to predict is a percentage which means it can not exceed 100%, exposing the flaw that our model will predict impossible to reach sanitation percentages for high enough average income inputs. This gives us problems regarding what we can and can not conclude due to risk of extrapolating further than our data provides. There is not much we can do about this using a simple linear model as we would have to introduce a polynomial regression equation (say, through a logarithmic transformation) to be more precise.

Simulation

To further investigate the quality of our model, we created a simulated data set using our regression model and the actual adjusted income amounts as inputs. Below are plots comparing the observed data and our simulated data.

Code
data_predict <- predict(data.fit) #predicted values
data_sig <- sigma(data.fit) #s or residual std error

noise <- function(x, mean = 0, sd){ #stolen from Prof Robinson :,)
  x + rnorm(length(x), 
            mean, 
            sd)
}

sim_response <- tibble(sanitation = noise(data_predict, 
                                           sd = data_sig)
                      )
Code
full.dist <- full.data |>
  ggplot(aes(x = sanitation)) +
  geom_histogram() +
  labs(x = "Observed Sanitation Index",
       y = "",
       subtitle = "Count") +
  theme_bw()

sim.dist <- sim_response |>
  ggplot(aes(x = sanitation)) +
  geom_histogram() +
  labs(x = "Simulated Sanitation  Index",
       y = "",
       subtitle = "Count") +
  theme_bw()

full.dist + sim.dist

Above are histograms of basic sanitation percentage in the observed data and our simulated data, which highlight the poor quality of our model. While our observed data is skewed to the left, the simulated data follows a normal distribution. Furthermore, they highlight the weakness that our model outputs sanitation percentages that are more than 100 percent, which make some conclusions meaningless.

Code
sim.plot <- sim_response |>
  ggplot(mapping = aes(x = full.data$income, y = sanitation) ) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(limits = c(0,200),
                     breaks = seq(0,200,25)) +
  labs(title = "Predicted Sanitation versus Income, 1999-2019",
       subtitle = "Percentage of People with (at least) Basic Sanitation Services",
       x = "Adjusted Income per Person (in 2017 $)",
       y = "") 

data.plot2 <- full.data |>
  ggplot(mapping = aes(x = income, y = sanitation) ) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(limits = c(0,200),
                     breaks = seq(0,200,25)) +
  labs(title = "Income versus Sanitation, 1999-2019",
       subtitle = "Percentage of People with (at least) Basic Sanitation Services",
       x = "Adjusted Income per Person (in 2017 $)",
       y = "") 

sim.plot + data.plot2

Comparing plots of percentage of people with basic sanitation and adjusted income per person for the observed and simulated data provides us more evidence of the flaws of our model. Even though the simulated data communicates the general idea of the observed data, it seems that our model oversimplifies the relationship between our variables.

Code
nsims <- 1000
sims <- map_dfc(.x = 1:nsims,
                .f = ~ tibble(sim = noise(data_predict, 
                                          sd = data_sig)
                              )
                )
colnames(sims) <- colnames(sims) |> 
  str_replace(pattern = "\\.\\.\\.",
                  replace = "_")

temp <- full.data |> 
  select(sanitation) |>
  bind_cols(sims)
  
sim_r_sq <- temp |> 
  map(~ lm(sanitation ~ .x, data = temp)) |> 
  map(glance) |> 
  map_dbl(~ .x$r.squared)#removes first col

tibble(sims = sim_r_sq[-1]) |> 
  ggplot(aes(x = sims)) + 
  geom_histogram(binwidth = 0.0025) +
  labs(titel = "Distribution of $R^2$",
       x = expression("Simulated"~ R^2),
       y = "",
       subtitle = "Number of Simulated Models")

discuss implications of above plot how well does our model generate data similar to observed

Log Transformed Model

We wanted to attempt a log transformation of adjusted income per person in order to improve our model. Upon doing this, we received the following regression equation.

Code
temp.data <- full.data
temp.data$income <- log(temp.data$income, base = 2) #transforming by log base 2

temp.fit <- lm(sanitation~income, data = temp.data) #fitting new model

\[\hat{Y} = -102.2536 + 13.32\text{log}_2(X)\]

Code
temp.data |>
  ggplot(mapping = aes(x = income, y = sanitation) ) +
  geom_point() +
  geom_smooth(method = lm, level = 0, color = "red") +
  scale_y_continuous(limits = c(0,100),
                     breaks = seq(0,100,25)) +
  labs(title = "$Log Transformed Income versus Sanitation, 1999-2049",
       subtitle = "Percentage of People with (at least) Basic Sanitation Services",
       x = "Log Adjusted Income per Person (in 2017 $)",
       y = "")

If we double adjusted income per person, our new model predicts a 13.31 percent increase in the mean value of percentage of people with access to basic sanitation. The proportion of variability accounted for by this model was 0.5651. This is much better than the non transformed model, but still not great. This model still outputs sanitation percentages over 100 percent, but does it much less often. The data does seem to curve, so the introduction of a polynomial regression equation might improve the model; but for our purposes, this is sufficient.

Code
data_predict <- predict(temp.fit) #predicted values
data_sig <- sigma(temp.fit) #s or residual std error

noise <- function(x, mean = 0, sd){ #stolen from Prof Robinson :,)
  x + rnorm(length(x), 
            mean, 
            sd)
}

sim_response <- tibble(sanitation = noise(data_predict, 
                                           sd = data_sig)
                      )

We decided another simulation was needed on this model to see how much better it was compared to our previous one.

Code
temp.dist <- temp.data |>
  ggplot(aes(x = sanitation)) +
  geom_histogram() +
  labs(x = "Observed Sanitation Index",
       y = "",
       subtitle = "Count") +
  theme_bw()

sim.dist <- sim_response |>
  ggplot(aes(x = sanitation)) +
  geom_histogram() +
  labs(x = "Simulated Sanitation  Index",
       y = "",
       subtitle = "Count") +
  theme_bw()

temp.dist + sim.dist

Talk about improvements of this model in simulating data briefly about its flaws

Code
sim.plot <- sim_response |>
  ggplot(mapping = aes(x = temp.data$income, y = sanitation) ) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(limits = c(0,200),
                     breaks = seq(0,200,25)) +
  labs(title = "Predicted Sanitation versus Log Income, 1999-2019",
       subtitle = "Percentage of People with (at least) Basic Sanitation Services",
       x = "Adjusted Income per Person (in 2017 $)",
       y = "") 

data.plot2 <- temp.data |>
  ggplot(mapping = aes(x = income, y = sanitation) ) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(limits = c(0,200),
                     breaks = seq(0,200,25)) +
  labs(title = "Log Income versus Sanitation, 1999-2019",
       subtitle = "Percentage of People with (at least) Basic Sanitation Services",
       x = "Adjusted Income per Person (in 2017 $)",
       y = "") 

sim.plot + data.plot2

Same thing as above in terms of these plots

Code
nsims <- 1000
sims <- map_dfc(.x = 1:nsims,
                .f = ~ tibble(sim = noise(data_predict, 
                                          sd = data_sig)
                              )
                )
colnames(sims) <- colnames(sims) |> 
  str_replace(pattern = "\\.\\.\\.",
                  replace = "_")

temp <- temp.data |> 
  select(sanitation) |>
  bind_cols(sims)
  
sim_r_sq <- temp |> 
  map(~ lm(sanitation ~ .x, data = temp)) |> 
  map(glance) |> 
  map_dbl(~ .x$r.squared)#removes first col

tibble(sims = sim_r_sq[-1]) |> 
  ggplot(aes(x = sims)) + 
  geom_histogram(binwidth = 0.0025) +
  labs(x = expression("Simulated"~ R^2),
       y = "",
       subtitle = "Number of Simulated Models")

discuss implications of above plot how well does our model generate data similar to observed

References

Income Data Set. Collected by World Bank, Maddison Project, Penn World Table (University of Groningen).

Sanitation Data Set. Collected by World Health Organization, UNICEF.